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ABSTRACT 

Context. Discovery of new variability classes in large surveys using multivariate statistics techniques such as clustering, relies heavily 
on the correct understanding of the distribution of known classes as point processes in parameter space. 

Aims. Our objective is to analyze the correspondence between the classical stellar variability types and the clusters found in the 
distribution of light curve parameters and colour indices of stars in the CoRoT exoplanet sample. The final aim is to help in the 
identification on new types of variability by first identifying the well known variables in the CoRoT sample. 

Methods. We apply unsupervised classification algorithms to identify clusters of variable stars from modes of the probability density 
distribution. We use reference variability databases (Hipparcos and OGLE) as a framework to calibrate the clustering methodology. 
Furthermore, we use the results from supervised classification methods to interpret the resulting clusters. 

Results. We interpret the clusters in the Hipparcos and OGLE LMC databases in terms of large-amplitude radial pulsators in the clas- 
sical instability strip and of various types of eclipsing binaries. The Hipparcos data also provide clear distributions for low-amplitude 
nonradial pulsators. We show that the preselection of targets for the CoRoT exoplanet programme results in a completely different 
probability density landscape than the OGLE data, the interpretation of which involves mainly classes of low-amplitude variabil- 
ity in main-sequence stars. Our findings will be incorporated to improve the supervised classification used in the CoRoT catalogue 
production, once the existence of new classes or subtypes will be confirmed from complementary spectroscopic observations. 

Key words. Methods: statistical; Methods; data analysis; (Stars:) binaries: eclipsing; Stars: variables: general; Stars: statistics; 
Techniques: photometric 



1. Introduction 

In the past decade, a plethora of new light curves of celestial 
objects has become available to the astronomical community. 
The OGLE project in its second stage (OGLE II), e.g., pro- 
vided astronomers with a total number of /-band magnitude time 
series of the order of 40 million. Ongoing variability-focused 
proj ects such as ASAS (Pojmanski 2002), MOST ( |Matthews| 
[2007,) or CoRoT (Fridlund et al. 2006), together with other fu- 



ture survey databases such as those expected from the Panoramic 
Survey Telescope & Rapid Response System (abbreviated as 
Pan-STARRSQ, the Large Synoptic Survey Telescope (LSST0I 
or Gaicj^ will provide us with a unique oportunity to discover 
new variability classes, either thanks to their unprecedented ca- 
pability to detect weak periodic signals (the MOST and CoRoT 
cases) or to the large number of objects observed (Gaia, Pan- 
STARRS or the LSST). 

The discovery of new types of objects in large astronomical 
databases is facilitated through the application of statistical tech- 
niques. In particular, in the regime of very few cases in a poten- 



tially new class, we would be dealing with a problem of outlier 
detection, whereas a sufficient number of examples in the new 
class allows for the use of clustering techniques for class discov- 
ery. In both cases, a priori knowledge of the spatial distiibution 
of the known classes is essential. 

In the general case, a variability survey can be seen as a col- 
lection of celestial objects for which we have determined a se- 
ries of parameters describing the variability in the time series. 
In this work we concentrate on periodic signals of variable stars 
as a starting point for understanding the variability diversity that 
can be found in any of the aforementioned surveys. Thus, typ- 
ical attributes used to describe the time series are the signifi- 
cant detected frequencies, fourier decomposition characteristics 
(harmonic amplitudes, amplitude ratios, phase differences, etc.) 
and colour indices. An example of the list of attributes used for 
supervised classification purposes can be found in Sarro et al. 



* The CoRoT space mission was developed and is operated by the 
French space agency CNES, with participation of ESA's RSSD and 
Science Programmes, Austria, Belgium, Brazil, Germany, and Spain. 

' http: //pan-Starrs . ifa.hawaii . edu/public/ 
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(2009). Each variable star can then be represented by a point in 
a multidimensional space, the axes of which are the attributes 
mentioned above. We can view the database as a realization of 
a random point process and we can approximate its originat- 
ing probability density function by means of parametric or non- 
parametric density estimation methods. In the outlier detection 
regime, exotic or unusual objects will thus be defined as occupy- 
ing regions of the parameter space characterized by low proba- 
bility densities. This further requires a deep understanding of the 
noise properties in the random process, like spurious frequency 
detections or the errors for low amplitude signals. In the cluster- 
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ing regime, it is hoped that a good clustering algorithm is capable 
of detecting the new class as a cluster 

A few clustering studies of variable celestial objects have 
been performed in the past. |Eyer & Blake (200 5]) studied AS AS 
variability data using Autoclass (Cheese man et al.|1988 1. They 
manually separated two types of variables according to the reg- 
ularity of the light curves. Some 45% of the total sample of 
1731 stars was considered to have a sufficiently regular be- 
haviour and was classified using Fourier decomposition coef- 
ficients. The remaining 55% was characterized by their period 
and second, third and fourth moments of their light curve dis- 
tribution. In the first group, Autoclass identified nine clusters. 
These were linked with eclipsing binaries (three clusters, two 
of which are relatively pure and a third one mixed with ambigu- 
ous light curves and wrong periods), Cepheids (two clusters, one 
with clear candidates, the other with dubious cases), RR Lyrae 
pulsators (one cluster with only four stars). Small Amplitude 
Variables (a generic name for a mixture of low signal-to-noise 
variables), and a last group without interpretation. Judging by 
the plots in Eyer & Blake ( 2005| ), the latter cluster is character- 
ized by periods in the range 0.1 to 40 days and low R21 values, 
where R21 is defined as the ratio between the amplitude of the 
first harmonic and the amplitude of the frequency itself, for the 
dominant frequency. Moreover, the corresponding phase differ- 
ence of these two Fourier terms, 0i2, has random values which 
points to almost sinusoidal light curves. 

[Brett et al.] (| 2004| l applied Self-Organized Maps (hereafter 



SOMs) to study the clustering structure of ROTSE preclassi- 
fied light curves. They also applied the algorithm to an artificial 
dataset of light curves for calibration purposes. They folded the 
time series with the period and binned the observations in a fixed 
number of phase bins. The training of the SOM was carried out 
on a set of light curves previously classified in one of the fol- 
lowing categories: detached eclipsing binaries, contact eclipsing 
binaries, 6 Scuti stars, RR Lyrae ab,RR Lyrae c, and Cepheids. 
The results indicate that, with this approach and dataset, SOMs 
are capable of separating detached binaries, contact binaries and 
RR Lyrae ab stars in three distinct categories, whereas RR Lyrae 
c, 6 Scuti stars and Cepheids are mixed in one cluster As the au- 
thors point out, the lack of information on the period is the main 
cause for this confusion of otherwise easily separable classes. 
The SOM approach has several major drawbacks. First of all, it 
does not incorporate information known to be relevant for the 
separation of clusters (e.g., the harmonics of detected frequen- 
cies). Also, the binning process renders the method inapplica- 
ble for databases where incomplete phase coverage is common 
(e.g., the Hipparcos and Gaia databases). Further, it is not ro- 
bust against spurious frequency detections, a common problem 
with eclipsing binaries where identical eclipses result in a de- 
tected frequency which is twice the orbital one. The folded light 
curve will thus not resemble the eclipsing binaries prototypes. 
Finally, the results have been obtained with a dataset which is 
the result of intensive selection, i.e., only high signal-to-noise 
variables with well determined periods and clear class assign- 
ments were used in the training of the SOM. 

In this work we characterize the distribution of variable ob- 
jects in parameter space. Natural groups formed in several refer- 
ence variability databases, such as the Hipparcos, the OGLE II 
Large Magellanic Cloud, and the CoRoT exoplanet programme 
are considered. Any clustering analysis is intrinsically limited 
by the sampling properties of the database to which it is applied. 
Thus, we can only hope to describe the realm of stellar variabil- 
ity to the limits of each experiment and the algorithms used to 
describe it, i.e., its limiting magnitude, sampling properties, sen- 



sitivity and the efficiency of the frequency detection algorithms. 
The capability of characterizing weak multiperiodic signals in 
faint stars in the Magellanic Cloud is limited, e.g., and clus- 
tering results in that regime cannot produce clean samples for 
this database. In our approach, we keep all variable stars in the 
database for which at least one statistically significant frequency 
was detected, regardless whether they have been previously clas- 
sified in other works or not. Our sample thus is as realistic with 
respect to the original probability density function as the data 
permits it to be. The goal is to understand as deeply as possible 
this point distribution, in order to be prepared for class discov- 
ery in the presently considered and other forthcoming variability 
databases resulting from Gaia, Pan-STARRS or LSST. 

2. Elements of a clustering approach 

There are many theoretical approaches to clustering in the spe- 
cialized statistics literature. In astronomy, the two most popular 
clustering techniques are Bayesian parametric modelling (e.g., 
the Autoclass implementation) and SOMs. There are several re- 
views describing unsupervised classification or clustering tech- 
niques (see e.g. |Jain et al. ( 1999| l). A description of recent ad- 
vances in the field is out of the scope of this article, but we be- 
lieve that a brief discussion of some of the main issues involved 
in this kind of multivariate analysis is necessary for the subse- 
quent understanding and interpretation of our results. 

Clustering techniques can be put into several categories de- 
pending on the viewpoint used to characterize them. For ex- 
ample, they can be divided into parametric and nonparametric 
techniques according to the hypothesis underlying the method. 
Nonparametric techniques make no a priori assumption regard- 
ing the shape of the dataset constituent clusters. A good ex- 
ample of nonparametric techniques are density based clustering 
techniques that identify cluster centres as the modes in a kernel 
based estimate of the point density. Obvious drawbacks of this 
approach are its incapacity to detect low contrast overdensities 
in the vicinity of dominant clusters if they do not result in local 
maxima, and the limitations inherent to the bias-variance trade- 
off dependence of the kernel widths. Other nonparametric tech- 
niques include hierarchical clustering and Nearest Neighbours 
related algorithms. 

On the other hand, parametric techniques are developed un- 
der the hypothesis that the database is a sample from a probabil- 
ity density distribution that either is a functional form (model) 
of the parameters, or can be approximated with a number of 
such functions. The simplest (yet useful) model is to assume that 
the data are generated from a distribution that is a linear combi- 
nation of multivariate normal distributions with unit covariance 
matrices. This results in spherical clusters and linear boundaries 
between clusters. This simple model has the mean of the dis- 
tribution and the width of the multivariate normal as parame- 
ters, and a popular algorithm to find them is the Expectation- 
Maximization algorithm that maximizes the likelihood of the 
data. The model can be made more flexible and sophisticated 
by allowing for more complex covariance matrices. This trans- 
lates into clusters of unequal major axis in arbitrary orientations 
(i.e., not necessarily aligned with the parameter space axis). This 
is the case of Autoclass, except for the fact that it goes beyond 
maximum likelihood estimation by introducing priors for the pa- 
rameters, thus turning into a fully Bayesian inference. 

One advantage of parametric models is that they permit the 
detection of clusters not associated with local maxima, such as 
those missed by density based methods. Unfortunately, the ev- 
ident disadvantage is that, if the underlying assumption is in- 
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correct or insufficient (i.e., if the clusters are not samples from 
the hypothesized parametric model), the resulting clustering is 
uninterpretable. This drawback is not present in non paramet- 
ric clustering, where clusters of arbitrary shape can be detected. 
Obviously, it is always possible to recontruct the real shape of a 
cluster by adding more and more model components (multivari- 
ate Gaussians in the most popular choice), but then the possibil- 
ity to detect new stellar populations is compromised because the 
clusters are no longer linked to differences in the point samples, 
but are created by the inability of the algorithm to fit the real 
distribution of points. 

Two other elements, central to the problem of finding a good 
clustering description of a dataset, are the automatic determina- 
tion of an optimal number of clusters, and that of feature se- 
lection. The former is closely related to the clustering evalu- 
ation measures, since choosing a number of clusters relies on 
the availability of an objective measure of clustering quality. In 
the framework of parametric clustering, the classical tools for 
model selection are readily available. In the Bayesian approach, 
the Bayes factor is commonly used, which incorporates a natu- 
ral Occam's razor by penalizing solutions with excessive num- 
ber of parameters (clusters). Obviously, one can find a perfect 
match to any density distribution by adding more and more com- 
ponents to our mixture model. This would be the equivalent to 
the concept of overfitting in supervised classification or regres- 
sion problems. Other ways to assess clustering quality related 
to maximum likelihood estimators are the Bayesian Information 
Criterion (BIC) and Akaike Information Criterion (AIC), or the 
Median Split Silouhettes (MSS, [Pollard & Laan| ( |2U02) l) and the 
gap statistic (Tibshirani et al. (2000 1), which can also be applied 
to nonparametric clustering. 

The other critical aspect of clustering experiments is the se- 
lection of an optimal attribute set, capable of incorporating the 
attributes where clustering occurs but, at the same time, discard- 
ing those irrelevant to treat the problem. Unfortunately, this field 
has not evolved as much in unsupervised as in supervised clas- 
sification, and there is no commonly accepted technique to per- 
form attribute selection. In the experiments described in Sect.|4] 
we selected the attributes manually based on the past experience 
in supervised classification. 

One last aspect that has to be considered in the selection of 
an algorithm for the clustering analysis of variablity data is that 
of scalability. Both the time and space complexities of the al- 
gorithms, and the possibility of parallelizing the data processing 
has to be considered, specially if the algorithms have to be ap- 
plied to large databases such as those expected in the next few 
years. 

This summary is not intended to be exhaustive but rather 
brings an introduction to the key aspects of the problem of class 
discovery in variability databases. There is simply no best clus- 
tering methodology for all types of problems. We stress the 
importance of making both the assumptions underlying each 
method and the evaluation criteria explicit before a clustering 
experiment is designed. 

3. Adopted methodology 

In designing the experiments and methodology described below, 
we imposed several requirements. The solution proposed here is 
an attempt to fulfil all of the following requirements: 

1. Interpretability. We consider that no class discovery is pos- 
sible if the clusters obtained by a given method are not inter- 
pretable in terms of the experts domain knowledge. In this 



case, the domain is that of regular variable stars and this 
means that the clusters defined by the method have to re- 
cover the classical variability types reasonably well. This re- 
quirement obviously has to be related to the parameter space 
where clustering is performed: it can not be hoped to sepa- 
rate variability classes only differing in some spectroscopic 
signature (e.g., due to metallicity effects) if the parameter 
space is built only with the properties of a photometric time 
series. 

2. The algorithms involved in the clustering process have to 
be scalable to the orders of magnitude typical of the era of 
astronomical surveys. Nowadays, large scale variability sur- 
veys like those expected to come from Gaia, Pan-STARRS 
or LSST, will be producing parameters for ^ 10^ objects or 
more, and class discovery techniques have to be capable to 
deal with these enormous databases. 

3. The methodology has to be model free or, at least, use the 
most flexible models available, in order to avoid that the clus- 
ters obtained are dictated by the model selected. 

In the following, we describe the clustering methodology 
adopted in the performed experiments. We decided to use the 
so-called density based approach to clustering, whereby clusters 
are identified with maxima of the density distribution, and points 
are assigned to their closest maximum or mode. This density is 
never actually computed and points are assigned to modes (clus- 
ters) by using the Modal EM (Expectation Maximization, MEM) 
algorithm ( Li et al.|2007 ), which can be summarized as follows. 
Let S = {xi,X2, ■ ■ ■ ,x„] he the set of objects x, e to be clus- 
tered. The d components in each jc, are the attributes or param- 
eters used to characterize the variable objects (frequencies, am- 
plitudes, phase differences, colour indices, etc.; see below). Let 
fix) - 'Zjk=i .fk(x) be the density of points approximated by a 
Mixture Model with K components. In our case, we used multi- 
variate normal distributions Nfixix), with mean fi and covariance 
matrix "E. Then, given a point x, the MEM algorithm consists 
in alternating the following two steps, starting with r = and 

= x: 

1. Expectation: compute the posterior probability of each mix- 
ture component, pt = /(xM) ' 

2. Maximization: update x^''"^'-* = argmax^ Yjk=\ Pk log(/i:(x)). 

When applied to one object in the database defined by x,, the 
MEM algorithm converges to a local maximum of the density 
estimate. Clustering is then accomplished by grouping points 
that converged to the same mode (to within some tolerance). 
Our starting point is a kernel density estimate such that, if the 
database contains n objects, f{x) is defined by a mixture model 
with n components, one associated to each object: 



fix) 



ix). 



(1) 



Given that the implementation has to be capable of process- 
ing datasets with a number of instances of the order of several 
times lO** (in these experiments) and larger, the implementation 
makes use of A;-dimensional trees ikd-trees hereafter; see e.g. 
[Freidman et aL] ( [T977l l and |Bentley| ( fT980) l). 

It is evident that the MEM algorithm involves an approxima- 
tion to the density which, in this case, is a kernel based approx- 
imation. Therefore, depending on the kernel bandwidth used 
in the estimate, we will have different sets of modes: less nodes 
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and large clusters for a large kernel width, and many small clus- 
ters when the kernel bandwidth is small. At each scale we are 
analysing different aspects of the dataset but, as the kernel width 
is made smaller, the modes of the density distribution can be 
made to correspond to local maxima due to noise in the finite 
sampling of the point process. Therefore, the smallest scales one 
is ready to accept have to be judged on physical terms in order 
to establish the significance of the clusters obtained. 

In our case, we carried out the analysis in a hierarchical man- 
ner, starting with large kernel bandwidths to define the large 
groupings in the data. Thereafter, we decreased the widths until 
we could no longer establish the meaningfulness of the largest 
clusters obtained. We also implemented adaptive kernel widths 
in order to avoid large variances in the clustering results in the 
low density regions. The kernel width adaptation takes into ac- 
count the local density of points in order to increase the width in 
regions where the nearest neighbours are at large distances and 
thus, fixed kernel estimation would result in a local maximum of 
the density at each observed object. 

Since these experiments were designed to serve as the ba- 
sis for subsequent class discovery, we concentrated the descrip- 
tion of the clusters in terms of known classical variables. Thus, 
the clustering decomposition (the decrease in the kernel band- 
widths) stopped when we no longer could interpret the resulting 
clusters. 



4. Performed experiments 

In the following, we describe the experiments we per- 
formed with three databases: the Hipparcos variability archive 
[1997) , 



( [Perryman & ESA| [1997) the OGLE data base of Large 
Magellanic Cloud variables ( [Zebrun et al. 200I[ ), and the CoRoT 
database corresponding to the first four runs (IRaOl, SRcOl, 
LRcOl, and LRa01; |Debossch er et al!](| 2009] l). These databases 
are characterized by very different statistical properties. While 
the sampling is quite regular but drastically different in the 
OGLE and CoRoT databases, this is less so in the Hipparcos 
case. The number of cases in each sample is also very different: 
4335 1 variabels were retrieved from the OGLE LMC database, 
2419 variables from Hipparcos and 14642 from CoRoT (see be- 
low for selection criteria). Finally, colours are available for very 
large fractions of the OGLE and Hipparcos databases whereas 
the CoRoT sample is analysed only based on parameters derived 
from the time series. 

The figures given above for the number of objects in each 
database are the result of selecting sources with some predefined 
criteria. For both the OGLE and Hipparcos databases of variable 
stars, we selected objects with available V - I colour index. For 
the Hipparcos and CoRoT databases we also reques ted p-values 



(see 



Debosscher 



in the frequency detection phase below 10 
|et al.| ( p00 9^ for a definition of the /^-values). 

The original time series in each database was processed in 
order to detect statistically significant frequencies. The proce- 
dure was describ ed in detail in |Debosscher et al.| ( |2007[ l, |Sarro| 
|et al.| ( |2009| l, and [Debosscher et aL ( 2009| l and is therefore omit- 
ted here. We only retain five relevant attributes in the analysis 
below, namely the first detected frequency (denoted as v in the 
OGLE and Hipparcos cases and as vi in the CoRoT case and 
expressed in d ' throughout the paper), the amplitude of the 
first component in the Fourier decomposition of this frequency 
All, the R21 ratio which stands for the ratio of the amplitudes 
of the first two components in the Fourier decomposition of the 
first frequency, the phase difference 0i2 between the second and 
first harmonic components in the Fourier decomposition and (in 



the OGLE and Hipparcos cases), the V - I colour index. In the 
CoRoT case, given the quality of the time series, we used the am- 
plitudes of the first four components in the Fourier decomposi- 
tion of the first detected frequency, and the first two of the second 
detected frequency V2- Further attributes can be added in order to 
look deeper into the composition of the clusters found. This in- 
cludes parameters related to higher orders in the Fourier decom- 
position and additional undereddened colour indices if available, 
such as J - H, or H - K. The latter only make sense for se- 
lected clusters where multiperiodicity or infrared emission are 
attributes expected to result in clustering structures at smaller 
scales. 

In the following, we present and describe clustering results 
obtained with the density based methodology described in sec- 
tion |3] As explained above, clusterings with different levels of 
detail can be obtained with different kernel bandwidths (the 
covariance matrix used in the normal distributions placed at each 
point). Thus, starting with a large kernel bandwidth (resulting 
in few large clusters), it is possible to refine and subdivide the 
clusters obtained by using increasingly smaller bandwidths. The 
resulting structure resembles a tree (the dendrogram) where clus- 
ters (branches) subdivide into smaller clusters at smaller scales. 

The covariance matrix can be chosen isotropic (E^ = 
cr^ ■ 1) with cr^ being the tuning parameter effectively controlling 
the kernel bandwidth), or, more generally, anisotropic. In this 
work we have made the diagonal elements proportional to the 
nearest neighbour distance. The proportionality is introduced by 
normalizing the covariance matrix such that the maximum along 
the diagonal is made equal to the tuning parameter cr^. 

The parameters selected for the clustering process and the 
level at which the dendrogram is cut are mostly dictated by the 
interpretability of the clusters and clarity of plots. While it is 
possible to present purer clusters (i.e., with more homogeneous 
components) it is usually at the expense of a larger number of 
clusters. It is important to remark that each dimension of the data 
was normalized in the range [0, 1] to avoid attributes with larger 
absolute values to dominate the kernel variance. Therefore, the 
input data to the clustering algorithm are contained inside a 
unit hypercube. This means that the quoted values of cr (the 
kernel bandwidth) are relative to this unit length defining the 
range in each dimension. The values used for each experiment 
(Hipparcos, OGLE LMC and CoRoT) were 0.2, 0.15 and 0.2 re- 
spectively. While, as explained above, the availability of colour 
indices and class labels in the first two experiments guided us in 
the selection of the actual value of the kernel width, in the case 
of the CoRoT database it was based on the visual inspection of 
density contour and scatter plots. The dumpiness of the distri- 
bution of data points in parameter space was somewhat less than 
for any of the other two experiments, suggesting a larger value of 
cr. We nevertheless decided to maintain the same value cr = 0.2, 
given the unprecedented quality of the CoRoT data. 

Plots of the clusters will be numbered and referred to accord- 
ing to the cluster identifier 

4. 1. The Hipparcos archive 

The entire Hipparcos variability catalogue was processed as de- 
scribed in Debosscher et al. (2007 [l, and l ight curves with p- 
values, as defined in Debosscher et al. ( 2009| , below 10^ were re- 
tained for clustering analysis. Furthermore, we made use of class 
labels in order to interpret the resulting clusters. Figure[T] repre- 
sents the 16 clusters found in this set of objects in the log(v)- 
{V - I) plane, using a value of the kernel width cr - 0.2. Other 
projections of the original database using the other attributes can 
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be found in Figures [r8p6] For the discussion of the Hipparcos 
database, we make use of the class labels from the General 
Catalogue of Variable Stars ( Samus et al.|2002 GCVS) because 
this was also done by the Hipparcos team ( |Eyer & Grenon|1997| 
e.g.)- 

From Fig.[T]we deduce that the algorithm separated long pe- 
riod (i.e., low frequency) variables of similar properties but with 
different behaviour in the values of (/>i2. Examples are clusters 7 
and 10 which share similar properties and are only separated in 
the value of the (pi2 of their modes. This is obviously an artefact 
due to the use of the quantity 0i2 as an attribute, while this quan- 
tity is invariant with respect to In. The phase difference <pi2 is 
only useful for a few classes, like RR Lyrae pulsators, Cepheids 
or eclipsing binaries because they show highly non-sinusoidal 
light curves. For long period variables (classes M, SR, I and L 
in the nomenclature of the GCVS) or low-amplitude pulsators, 
we find a more homogeneous coverage of the range of values of 
012, in agreement with the physics of their oscillations and their 
nearly sinusoidal variations in the light curve which implies a 
large uncertainty in the /?2i and (pi2 values. 

Further, the algorithm succeeds in the recovery of the most 
conspicuous groups of variables: cluster 1 contains most of 
the Cepheids in the Hipparcos catalogue, cluster 13 most of 
the RR Lyrae stars, cluster 4 the binaries of the EB type, and 
cluster 6 binaries of the EA and EB types. The Long Period 
Variables group (LPVs) includes as most numerous subgroups 
the Miras (M), Semiregular Variables (SR), Irregular Variables 
(I) and Slow Irregular Variables (L). These subtypes are spread 
over clusters 2, 5 and 12 (SR, I, and L), and 3, 7, 10, and 11 
(Mira Variables with negligible contributions from other LPVs). 
While cluster 12 is characterized by periods of the order of sev- 
eral hundred days, clusters 2 and 5 show a much larger spread 
of periods down to a few days. Cluster 3 contains the lower- 
amplitude Miras, whose (^^-values are spread out compared to 
these for the Miras in clusters 7, 10, and 11. The latter are artif- 
ically divided due to the In degeneracy in 0i2 on the one hand 
and due to redder colour indices on the other hand. 

We interpret the separation of Miras into four distinct clus- 
ters as the result of a finite sample, producing local low con- 
trast maxima at random locations. A larger kernel density would 
smooth out these small overdensities at the expense of merg- 
ing neighbouring clusters. The obvious solution to this is to cut 
the clustering dendrogram at different levels in each branch. 
Furthermore, this allows for the selection of relevant attributes 
in each branch, the necessity of which is exemplified by the seg- 
regation of groups introduced by an attribute such as (pi2 which 
is not relevant for all classes of variables but very discriminating 
for others. 

Cluster 14 contains most of the 6 Set stars in the sample 
while all other contributions to this cluster are small. The second 
most numerous contributing class are the SX Phe stars with only 
three objects. This is fully understandable from a physical point 
of view, as the SXPhe stars' variability behaviour is very similar 
to the high-amplitude 6 Set stars but they are old Population II 
stars instead ( Rodriguez & L6pez-Gonzalez||200(j i. Most other 
SX Phe stars constitute a satellite group of the RR Lyrae stars in 
cluster 13 (these two groups separate for smaller values of the 
kernel width) and are not assigned to the 6 Set cluster because 
they have larger amplitudes than the prototypical Population I 
5 Set stars and only somewhat smaller than the Population II 
RR Lyrae stars (see Fig. 18 i. It is very assuring that the algo- 
rithm manages to distribute the SX Phe stars over the two other 
clusters whose physical properties they share. 



Cluster 15 is composed of rather strictly periodic blue stars. 
These are pulsating stars of spectral type B belonging to the 
jSCep, slowly pulsating B (SPB) and periodically variable su- 
pergiant (PVSG) classes discovered by means of multivariate 
discriminant analysis by Waelkens et al. ( 1998 1 and confirmed 
from ground-based follow-up data by |Aerts| ( 2000| l, Aerts et al. 
( |1999p , and Lefev er et al.| ( |2007] l. These stars are in general non- 
radial multiperiodic oscillators with low-order pressure (J3 Cep) 
or gravity (SPB, PVSG) modes excited by the a: mechanism. 
Their amplitudes are not too different from those of the 5 Set 
stars, even though some of the class members show somewhat 
more outspoken small nonlinear effects in their oscillations. 
Their periods are typically an order of magnitude longer than 
those of the 6 Set stars, as nicely illustrated in Fig.[T] The ground- 
based follow-up studies have shown that some of these candidate 
pulsating B stars turned out to be ellipsoidal binaries ( |De Cat 
et al. 2000) or spotted stars (Briquet et al. 2004) with periods 



similar to the oscillation periods of SPBs, but this was the case 
for a small minority only. 

Cluster 9, on the other hand, contains yDor stars. These are 
F-type multiperiodic stars pulsating in nonradial gravity modes 
with periods similar to those of the SPBs, i.e., also an order 
of magnitude longer than the 5 Set stars. They are situated on 
the red side of the classical instability strip. Numerous new 
class members were indeed found in the Hipparcos database by 
Aerts et al. ( 1998 ) and Handler (1999J . Most of these were con- 
firmed as class member later on, from extensive follow-up stud- 
ies ( |Aerts et al.|[2004t ,Mathias et al.|[2004l |De Cat et al.|[2006l l 
just as for the B-type pulsators discussed above. The amphtude 
and phase behaviour of the y Dor stars is similar to the one of 
the 6 Set, /3 Cep and SPB stars, taking into account the In degen- 
eracy for (pi2- 

Finally, clusters 8 and 16 are dominated by so-called y 
Cassiopeiae stars (GCAS in the GCVS nomenclature). The star 
yCas is a member of the classical Be stars, which are objects 
showing Balmer line emission in their spectrum due to the pres- 
ence of a circumstellar disc (see |Porter & Rivinius| ( |2003l ) for an 
extensive review on the observational and physical properties of 
those stars). This inhomogeneous class consists of both single 
and binary stars and, moreover, some Be stars show oscillations 
and/or light curve trends while others do not. It is thus normal 
that these objects are spread out over various clusters. Clusters 8 
and 16 contain 32 and 17 such stars, respectively. The difference 
between the two cluster modes is only due to the phase differ- 
ence 012 of In, which is again artificial. Cluster 15 described 
above, also contains 9 Be stars according to the Hipparcos clas- 
sification. 

We conclude that our clustering analysis turned out to be a 
powerful method to separate the monoperiodic large-amplitude 
classical variables and binaries from the low-amplitude multi- 
periodic nonradial pulsators. Since these populations are well 
represented in the Hipparcos database of variables, we have 
shown our methodology to have the potential to recover all these 
classes in unexplored new databases as well, by means of the at- 
tributes that we have used here. Moreover, our analysis divided 
the four classes of nonradial pulsators along the main sequence, 
i.e., the best targets for asteroseismology (see | Aerts et al. ( 2008| l 
for a review), in four different clusters with the appropriate prop- 
erties according to the physics of these stars. 

4.2. The OGLE archive 

The OGLE LMC variability database was processed in the same 
way as the Hipparcos archive and an equivalent dataset was gen- 
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Fig. 1. The clustering structure of the Hipparcos variable stars archive at cr = 0.2. The x-axis represents the logarithm of the 
frequency and the y-axis the V-/ colour index. Black dots represent the complete database and red crosses identify cluster members. 
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erated with the same attributes. However, the phase difference 
012 was not used in the clustering experiments. While the use 
of 012 was convenient to avoid a strong contamination of the 
RR Lyrae and Cepheids clusters by a particular type of eclipsing 
binaries for the Hipparcos database, the denser sampling (4335 1 
objects) defines the maxima of the density distribution better in 
the OGLE LMC case and diminishes (but does not eliminate) the 
contamination. Rather, the Wessenheit index Wj = /-1.55(y-/), 
which is a brightness indicator independent of extinction, was 
added in the analysis since this quantity has been used previ- 
ously in the mining of the OGLE database. 

The list of well identified clusters at the largest scales is simi- 
lar to the one described above for the Hipparcos database, except 
that the OGLE data are less suited to find low-amplitude multi- 
periodic variables. This is illustrated in Fig. [2] in the frequency- 
Wi index plane, and further clarified in other projections in 



Figs 27 ■ 40 Amongst the largest clusters are those containing 
the First Overtone (FO) and Fundamental Mode (FU) Cepheids 
(clusters 7 and 8, respectively), and the RR Lyrae (cluster 6). The 
RR Lyrae cluster is further separated at smaller scales, into two 
large clusters corresponding i) to the ab subtype, and ii) to a mix- 
ture of the c subtype and double mode RR Lyrae pulsators (see 
Fig.[3]l, and several smaller clusters containing mainly spurious 
detections. 




The Long Period Variables sequences defined by Soszynski 
|et al.| ( [2005| l in the WjK-iog{P) plane (where the Wjk index is 



log(v) 



Fig. 3. The two largest subgroups in the RR Lyrae cluster, ob- 
tained with a smaller kernel bandwidth (cr = 0.1) than in Fig.|2] 
They correspond to the well known loci of the RRab (red dots) 
and RRc/RRd (blue dots) subtypes. The x-axis represents the 
logarithm of the frequency and the y-axis, the Wessenheit index 
Wi. 



the equivalent of the Wessenheit index for the J and Ks near 
infrared photometric bands) are grouped into four large cate- 
gories. The Long Period Variables region in Fig.|2] is shown 
again for clarity in Fig.|4] Each of these clusters is further sep- 
arated into smaller groups at finer scale levels in the clustering 
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Fig. 2. The clustering structure of the OGLE LMC archive at o" - 0.15. The x-axis represents the logarithm of the frequency and 
the y-axis, the Wessenheit index Wj. Black dots represent the complete database and red crosses identify cluster members. 
Note the occurrence of spurious frequencies at a value of one per day and multiples thereof due to the daily gaps in the data (one-day aliasing 

problem). 



hierarchy. In order of decreasing frequencies (increasing peri- 
ods), the first and largest cluster comprises the A and B se- 
quences (the so-called OGLE Small Amplitude Red Giant Stars, 
OSARGS); the adjacent cluster corresponds to the C sequence 
(cluster 2; Mira and Semiregular variables) in the WyA:-log(P) 
plane. Fig.|4] shows that this cluster can be separated into sev- 
eral groups at smaller bandwidth scales. The reason for this is 
that the unique C sequence in the Wjfc-log(P) plane, splits into 
two when the Wessenheit index Wj is used instead of Wjk, with 
each ridge characterized by a different slope corresponding to 
the different chemical composition of the stars (O-rich in the 
case of the larger frequency cluster and C-rich in the other clus- 
ter). The C sequence in the WjK-log(P) plane is barely visible 
when the Wessenheit index is used, but the algorithm detects a 
cluster (cluster 3) that coincides with the descriptions found in 
Soszynski et al. (2005), except for the fact that it also contains 
objects in the D sequence (those with redder V - I colours and 
larger amplitudes). Cluster 4 also seems to correspond to objects 
in the C sequence. Finally, the D sequence is clearly visible at 
the smallest frequencies. As mentioned above, these groups (ob- 
tained with a large kernel bandwidth) are further split into the 
well known Long Period Variables sequences in subsequent lev- 
els of the hierarchy obtained with smaller kernel bandwidths. 

The eclipsing binaries and ellipsoidal variables are gathered 
in 5 clusters (clusters 9 to 13). Cluster 9 corresponds to detached 
binaries of the EA and EB types, with flat light curves between 
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Fig. 4. The clustering structure of the OGLE LMC archive at 
cr - 0.15, in the Long Period Variables region. The x-axis 
represents the logarithm of the frequency and the y-axis, the 
Wessenheit index Wj. Blue dots represent the A and B sequences 
(cluster 1), red dots represent the C sequence (cluster 2), dark 
green corresponds to the C sequence and the upper part of the D 
sequence (cluster 3) and orange represents the D sequence (clus- 
ter 4). 



eclipses. As we go from cluster 9 to cluster 12, we see a se- 
quence in the shapes of the eclipsing binaries light curves, in 
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the sense that the ecHpse widths increase until there no longer 
exists a flat part of the light curve between eclipses. Figure 41 
shows several examples of the light curves of objects closest to 
the corresponding cluster modes. Following the results obtained 
by Sarro et al. ( 2009 1 regarding the classification of hot main- 



sequence pulsators, some of these objects may have ended up in 
these binary star clusters. 

Cluster 13 contains two subclusters that are easily separated 
at smaller kernel bandwidths (see Fig.[5]l. One of them (cluster 
13a shown in blue in Fig.[5| shows a clear correlation between 
the frequency and the Wessenheit index, in agreement with the 
slope found by Soszyn ski et al.| (|2004 ), while cluster 13b (shown 
in red in Fig.[5]l shows no correlation in the log{v)-Wi plane. 
Visual inspection of the light curves in this cluster seems to indi- 
cate a predominant component of EW systems, although we also 
find eccentric ellipsoidals such as the one shown in Fig.|6] The 
comparison of clusters 13a and 14 with the population of vari- 



ables discussed by Soszynski et al. (2004 1 seems to suggest that 
these two clusters are mainly composed of ellipsoidal variables 
in binary systems. 



5 5- 




log(v) 



Fig. 5. The clustering structure of the OGLE LMC archive at 
cr = 0.3 (after renormalization to the unit hypercube) for cluster 
10. The jc-axis represents the logarithm of the frequency and the 
y-axis, the Wessenheit index Wj. It clearly shows how the clus- 
ter is divided into two subgroups corresponding to the locus of 
ellipsoidal variables (blue) and EW eclipsing binaries (red). 



The last two clusters in Fig.|2]correspond to systems occupy- 
ing the same region in the parameter space where the Hipparcos 
Be stars are found and correspond with stars having long pe- 
riods and blue colours in the LMC sample (see Fig. [30]). Visual 
inspection of the light curves of systems in the neighbourhood of 
the cluster mode indicates the presence of nonperiodic behaviour 
with long time scales in cluster 15 and a mixture of nonperiodic 
and strictly periodic behaviour in cluster 16. 

Additional clusters not shown in Fig.|2] show interesting 
properties, worth further investigation. One of them (shown in 
Fig.|7]i contains a selection of candidates to the category of low 
amplitude periodic variables ( [Debosscher et al.|2009 1. This fig- 
ure shows their position in two 2D projections of the parameter 
space, together with the fundamental and first overtone Cepheids 
loci. Two of the candidates (SC4 323401 and SC3 35239) be- 



long to the list of ultra low amplitude Cepheids in Buchler et al. 
(2005). Other may again represent some of the candidate slowly 
pulsating B stars found from supervised classification in Sarro 
letaL (2009). 

Another cluster, shown in the log(v)-(y - /) plane in Fig. [8] 
splits into several subclusters at smaller kernel bandwidths, two 
of which seem to group stars with properties typical of the 6 Set 
(shown as red dots) and /3 Cep (in blue) types, respectively. These 
correspond to some of the stars identified as such type of main- 



sequence nonradial pulsators in [Debosscher et al.] ( |2009| ). 




Fig. 6. Example of an ellipsoidal variable in an eccentric system 
(cluster 13b). The light curve qualitatively resembles the syn- 
thetic light curve computed for an eccentricity value e = 0.2 and 
periastron length p - 90deg shown in Fig. 8 of [Soszynski et al.] 
( |2004l l 



Fig. 8. Blue plus signs and red crosses represent the small scale 
structure of one of the clusters found, not shown in Fig. [2] The x 
axis corresponds to the logarithm of the frequency v , and the y 
axis, to the V - I colour index. 



Finally, the cluster shown in Fig.|9]is located in the region 
occupied by the various sequences of red giant variables dis- 
cussed in Soszynski et al. ( 2005 | l (see their Fig. 3), but does not 
correspond to any of the sequences discussed therein. It shows 
a remarkable correlation in the log(v)-(y - /) plane that seems 
to continue the sequence of C-rich red giants in the C sequence, 
although there also exists the possibility that it represents a sub- 
population of ellipsoidal variables (shown in blue in Fig.|9]l. 



We conclude to find reasonable agreement between the 
extractor-type and supervised classification results applied to the 
OGLE LMC database and our clustering results. The majority 
of variables in the OGLE database are clearly large-amplitude 
monoperiodic pulsators or binaries, but we do recover evidence 
for the occurrence of some low-amplitude variables. 
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Fig. 7. Low amplitude periodic variable candidates in two dimensional projections of the parameter space. In the left plot, the x 
axis represents the logarithm of the frequency, and the y axis the logarithm of the amplitude of the first harmonic component of the 
Fourier decomposition. In the right hand plot, the x axis represents the V - I colour index, and the y axis the Wessenheit index Wj. 
In both plots, classical Cepheids pulsating in the fundamental mode or the first overtone are represented with red dots, and the low 
amplitude candidates with blue crosses. 




Fig. 9. Red crosses represent stars in one of the clusters obtained, 
showing a strong correlation in the log(v)-W/ plane. The x axis 
corresponds to the logarithm of the frequency v , and the y axis, 
to the Wessenheit index W/. The region shown corresponds to the 
red giant sequences, and blue crosses correspond to the cluster 
interpreted as ellipsoidal variables in the text (shown here for 
comparison). 



4.3. The CoRoT intermediate arctiive 

In this section we show the results of the clustering analysis 
of the first four runs of CoRoT exoplanet data (IRaOl, LRaOl, 
LRcOl and SRcOl; see |Debosscher et al.| ( |200 9) for a descrip- 
tion of the data). The goal of the exoplanet programme of the 
CoRoT mission is to detect planets around stars through the tran- 
sit method (see "The CoRoT Book", Frid lund et aT] ( |2006| ), for 
more explanation). In order to achieve this, light curves of thou- 
sands of stars with a precision typically a factor 100 better than 
the Hipparcos and OGLE data have been gathered during long 
uninterrupted sequences (5 months for a long run, a few weeks 
for a short run). The probability of finding exoplanets is much 
higher for main-sequence stars than for evolved stars. As such, 
the CoRoT exoplanet database is heavily biased towards main- 
sequence stars while giants and supergiants have been avoided in 
the preselection of the targets as much as possible. It is therefore 
evident that the population of variable stars in the CoRoT and 
OGLE databases are almost opposite in nature, the Hipparcos 
variable star database bridging these two extremes. 



The analysis of the CoRoT dataset also differs from the 
previous two ones in that no colour information was used in 
the analysis. The CoRoT satellite is equipped with a dispersing 
prism that provides information on colour changes with time for 
a given star, but the masks are adapted to each of the stars sep- 
arately and hence this information cannot be translated into any 
kind of standard colour index. Although there are ground-based 
colour indices available for a fraction of the archive from the 
CoRoT EXODAT database ( |Meunier et al.|2007[ ), we preferred 
not to use them for the clustering because they have not been 
corrected for reddening. This is also the case for the 2MASS 
(Skrutskie et al. 2006 1 colour indices. Although reddening ef- 
fects should be less important in the infrared passbands, it has to 
be recalled that the CoRoT fields are located near the Galactic 
center (LRcOl and SRcOl) and anticenter (IRaOl, LRaOl; see 
the 2MASS Explanatory Supplement to the Second Incremental 
Data Release for a comparison between typical colour-colour di- 
agrams in Galactic plane and pole regions). Nevertheless, we 
have conducted a parallel experiment (described in section 4.3.1 
below) in order to assess the robustness of our results when 
2MASS colours are used as attributes for clustering. Also, we 
would like the results of the clustering analysis to be used di- 
rectly in the update and improvement of the supervised classi- 
fication system which is based solely on CoRoT light curves 
( |Debosscher et al.]|2009[ ). The lack of colour information will 
necessarily cause confusion among variability classes whose 
light curves are very similar This is the case, e.g., for jSCep and 
<5Sct stars, for SPBs and yDor stars, and for various classes of 
stars with activity. We thus expect these to be difficult to sepa- 
rate. 

On the other hand, the lack of colour information is partially 
compensated for by the unprecedented quality of the spectral fre- 
quency information that can be recovered from time series anal- 
ysis of the CoRoT exoplanet light curves. For most of the light 
curves, numerous frequencies at low amplitudes are detected 
and thus we may hope to discriminate easily between monope- 
riodic and multiperiodic variability. This is in contrast to the 
Hipparcos and OGLE light curves, which quite often show only 
one significant frequency. Hence, for the CoRoT database, it is 
most instructive to look at the clustering structure in the (vi , V2) 
and (vi,Aii) planes (Figs 10 and [TT[). We adapted the frequency 
threshold log(vi) > -1.2 as set in |Debosscher et al.| ( p009| l in 
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Fig. 10. The clustering structure of the CoRoT first four runs archive at cr = 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the logarithm of the second detected frequency. Black dots represent the complete database and red crosses 
identify cluster members. 



order to avoid long-term trends (of intrinsic or instrumental na- 
ture) to dominate the frequency derivation. We clearly see in 
Fig. 10 that many stars reside along the bisectors in the sub- 
panels, which points towards multiperiodic variables. The most 
conspicuous OGLE clusters corresponding to the monoperiodic 
RR Lyrae stars, Cepheids, eclipsing binaries and small ampli- 
tude red giants loci (more than 90% of the OGLE variables) are 
absent here. The implications for the supervised classification of 
these targets is straightforward: the performance of a classifier 
optimized for the OGLE sample of variable stars should not be 
maintained when applied to such a different stellar population. 
This is why the training set has been filled with CoRoT stars as 
much as possible for the supervised classification in Debosscher 
|etaL] ( [2059l l. 



The interpretation of the clusters shown in Figs 10 and 1 1 as 



well as in Figs 42 — 48 is again based on their position in the 



various projections of the parameter space, and also on the vi- 
sual inspection of the first tens or hundreds of time series and 
phase-folded light curves. As already emphasized, we expect 
the interpretation of the clusters to be more difficult than for the 
Hipparcos and OGLE cases, as the most prominent clusters for 
these databases are absent here and as we expect groups of stars 
with similar periods to be mixed due to lack of colour informa- 
tion. 



nificant clusters are found beyond these. The reader will have 
noticed from the subpanel labels that clusters 1, 3, 14 and 15 
are absent. Visual inspection of their light curves showed that 
these clusters contain mostly spurious frequency detections due 
to jumps in the light curves as a consequence of hot pixels spikes 
due to the passage of the satellite through the South Atlantic 
Anomaly. These effects are not intrinsic to the stars, which is 
why these clusters are not discussed here. For the time being 
we have no appropriate solution to this problem, although auto- 
mated jump correction and detrending methods as in Degroote 



The clusters shown in Figs 10 and 11 are sixteen of the 
largest clusters found, although many less populated yet still sig- 



|et al.| ( j2009] l seem a promising route for future applications be- 
fore the clustering analysis. We realise that these instrumental ar- 
tifacts sometimes occur together with underlying low-amplitude 
periodic signal and that the latter unfortunately is masked by the 
jumps in the time series or vice versa. We thus expect our re- 
tained clusters to be possibly contaminated by light curves with 
small instrumental jumps, particularly for the clusters with peri- 
odicities of the order of days. 

Let us look at the low-amplitude multiperiodic oscillators 
first. As already mentioned, we expect them to populate the bi- 
sector regions in Fig. 10 well above the point (log vi, log V2) = 
(-1,-1). This is the case for clusters 4, 5, 8, 11, 16, and less 
clearly also clusters 7, 12, and 19. Cluster 4, 8, and 11 clearly 
cover the periodicities of the /? Cep/6 Set range, while clusters 5, 
7, 12, and 16 correspond to frequencies in the SPB/y Dor range. 
The dominant frequency of the cluster 9 stars is also 5 Set-like. 
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Fig. 11. The clustering structure of the CoRot archive at cr - 0.2. The x-axis represents the logarithm of the frequency and the 
y-axis, the logarithm of the amplitude An in magnitudes of the first component in the Fourier decomposition for v^. Black dots 
represent the complete database and red crosses identify cluster members. 



The class labels obtained with the supervised classifiers pre- 
sented in Debosscher et al. ( 2009| l generally agree with these 
cluster identifications based on their position in parameter space. 

Based on the Initial Mass Function, 6 Set stars will predom- 
inate in clusters 4, 8, 9, 11 over the ySCep population (a hy- 
pothesis not contradicted by the unreddened EXODAT colours). 
Clusters 4 and 9 gather the vast majority of 5 Set stars (423 -H 
1 12 stars) in the first four runs. The clusters 8 and 1 1 contain 117 
and 103 stars respectively and they show distinctive features in 
several projections of the parameter space. Cluster 8 is character- 
ized by lower frequencies, larger amplitudes and larger /?2i ratios 
than clusters 4 and 9. We interprete this as the subgroup of 6 Set 
stars with nonlinear light curve distortion due to high-amplitude 
(non)radial modes. Cluster 1 1, on the contrary, seems to contain 
the (J Set stars with extremely low amplitudes. Cluster 9 is inter- 
preted as the group of 6 Set stars whose second frequency is very 
low. These are probably hybrid 6 Sct/y Dor stars, 6 Set stars with 
rotational modulation, or else 6 Set stars in binaries with grazing 
eclipses or ellipsoidal variability. The latter two options are less 
likely given that there is no specific phase relation for the second 
frequency (Fig. 49 1. On the other hand, these stars clearly have a 



higher value of A22 (Fig. 46 1 and R21 (Fig. 47 1 compared to those 



quantities for cluster 4 stars, pointing towards deviations from si- 
nusoidal light variability with frequency V2. This behaviour of V2 
can be interpreted in terms of rotational modulation or of non- 
linear g-mode oscillations triggered by resonant mode coupling 



( Buchler et al.|199"7 ^. It is very difficult to discriminate between 
these two scenarios without spectroscopic information, particu- 
larly since g modes may also be splitted into multiplets by rota- 
tional eff'ects on the oscillations. 

The clusters 8 and 1 1 show a clear bimodality in the unred- 
dened colour, vi diagram as shown in Fig. 12 where the stars 
of cluster 1 1 turn out to be somewhat bluer and those of clus- 
ter 8 redder than average. This independent information clearly 
suggests that there is a physical reason why these clusters were 
found as separate classes, rather than this being due to statistical 
noise in the point process. In fact, the low-amplitude 5 Set stars 
in cluster 1 1 seem to coincide with the low-amplitude pulsators 
filling the gap between the classical instability strip and the SPB 
strip found by Degroote et al. ( 2009 1 which explains why they 
are somewhat bluer than the majority of the 5 Set stars. On the 
other hand, the stars in cluster 8 have clearly longer periods and 
are redder, while their amplitudes are also somewhat higher than 
average. This frequency-colour behaviour indicates that they are 
rather evolved 5 Set stars. The time series of the objects closest 
to the clusters modes are shown in Fig. 13 and fully confirm our 
interpretation. 

The differences amongst clusters 5, 7, 12, and 16 are mainly 
related to the amplitudes of the first frequency. As shown in 
Fig.[TT[ cluster 7 corresponds to the largest amplitudes, cluster 
5 stars also have relatively high amplitude while cluster 12 and 
16 shows significantly smaller amplitudes, and cluster 16 also 
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Fig. 12. Frequency-colour diagrams of the low frequency multiperiodic pulsators clusters. The left plot shows clusters 4 and 9 (black 
filled squares), 8 (red circles), 1 1 (blue triangles), 5 (magenta diamonds), 7 (green circles), 12 (filled orange) and 16 (open orange). 
The right panel shows the same plot with the logarithm of the amplitude of the first Fourier component colour coded from red (low 
amplitudes) to yellow (high amplitudes). The x axis represents the EXODAT B - V colour index, and the y axis, the logarithm of 
the first detected frequency. 
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Fig. 13. Close-up of the CoRoT photometric time series of the object closest to clusters 4 (top left), 8 (bottom left), 9 (top right), 
and 1 1 (bottom right). The x axis represents the modified heliocentric Julian Date, and the y axis, the signal measured in counts. 



somewhat higher frequencies. Some of the light curves in the 
vicinity of the mode of clusters 12 and 16 have low signal-to- 
noise ratios. Moreover, cluster 16 is not only characterized by 
higher mean frequencies, but also by lower mean values of R2\, 
which implies that the oscillations in that group of stars is highly 
linear, while modest nonlinear effects seem to occur for the stars 
in cluster 5, 7, and 12. All these properties suggest that clusters 
5, 7, 12, and 16 contain SPB and yDor stars, which are both 
classes of multiperiodic gravity-mode nonradial oscillators. The 
time series of the four objects closest to the cluster 5 (top left), 7 
(bottom left), 12 (top right) and cluster 16 (bottom right) modes 



are shown in Fig. 14 and support this interpretation. 



The position of the stars in these four clusters in the unred- 
dened colour diagram shown in Fig. 12 seems to suggest that 
the clusters 5, 12, and 16 contain a mixture of SPB and yDor 
stars, with the majority being yDor stars, again based on the 
Initial Mass Function. The stars have essentially the same colour 
properties, which confirms that the bluer SPBs must be relatively 
low in number than the redder yDor stars. The stars in cluster 7 
are clearly redder and have longer periods connected with their 
higher amplitudes. Again, we interprete this as an evolutionary 



effect and thus this cluster contains the evolved yDor stars. A 
striking feature in Fig. 12 are some ten stars of cluster 12 and 
one of cluster 7 with far redder colours than all others in the 
four clusters with g-mode pulsators. While we have to be care- 
ful for overinterpretation of Fig. 12 due to lack of an appropriate 



correction for reddening effects, it seems to suggest that these 
stars are much more evolved than the other g-mode pulsators. It 
could be that this small group of stars represent the PVSG class, 
which are evolved B stars with SPB like oscillations with longer 
periods ( [Lefever et al.|2007] l. 

We cannot exclude that both cluster 8 and 7 are contaminated 
by a few pre-main-sequence 6 Set and y Dor stars, respectively, 
which would still be surrounded by remnant material of their 
birth cloud and which would be an alternative explanation for 
their redder colour 

Finally, we point out that the frequency - colour behaviour 
displayed in Fig.[T2]is in full agreement with the theoretical pre- 
dictions of the instability strips of the (5 Set and y Dor stars, along 
with a small number of pulsating B stars, as presented in, e.g., 
IDegroote etaL] ( [20091 ). 
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Fig. 14. Close-up of the CoRoT photometric time series of the object closest to clusters 5 (top left), 7 (bottom left), 12 (top right), 
and 16 (bottom right). The x axis represents the modified heliocentric Julian Date, and the y axis, the signal measured in counts. 



Several clusters contain stars whose variability patterns we 
interprete in terms of phenomena related to stellar activity. The 
largest cluster (cluster 2) is composed (again, according to the 
time series of the first tens of objects closest to the cluster mode) 
of stars showing time series characterized by the occurrence of 
harmonics of the dominant frequencies, which points to strongly 
non-sinusoidal light curves. Often, this can be explained in terms 
of one or several starspots, that can migrate in phase with respect 
to one another A typical light curve corresponding to the objects 
closest to the cluster mode is shown in Fig.fTS] 

Cluster 17 contains the most extreme examples of strict rota- 
tional modulation active in the stellar photospheres, in the sense 
of being characterized by some of the largest amplitudes in the 
CoRoT sample as well as very high 7?2i -values (see Figs 47 and 

Clusters 6 and 10 show similar behaviour but with an in- 
teresting peculiarity: the variability of these stars is system- 
atically described by two very different dominant frequencies 
(see Fig. 10 1. Cluster 6 shows clear signs of bimodality, some- 



ciated to a higher peak. Future spectroscopic data are needed to 
refine this preliminary interpretation of cluster 10. 



thing that is confirmed by clustering analysis with smaller kernel 
bandwidths: it is composed of two populations, one with similar 
values of the first two detected frequencies, interpreted as com- 
plex activity, and one characterized by values of log(vi) between 
-1 and 1, and typical values of log(v2) less than -1.0. The latter 
stars are interpreted as active stars whose light curves also show 
long-term trends. Fig. 50 shows close-ups of time series in the 



vicinity of the modes of cluster 6. We can see very conspicuous 
low frequency modulations superposed to an activity signature. 

Cluster 10 is also bimodal at smaller kernel bandwidths, but 
both subclusters are characterized by log(v2) values around -1.0, 
i.e., a long-term trends, while being bimodal in their log(vi), 
with values around -0.15 and 0.8, respectively. It may very well 
be that this groups consists of pulsating Be stars already dis- 
cussed above in the framework of the Hipparcos database. These 
stars indeed undergo trends due to some level of activity and/or 
outbursts, while showing also oscillations with frequencies of 



pressure modes or gravity modes. Fig. 51 show close-ups of time 
series in the vicinity of the modes of cluster 10. These light 
curves are free from significant jumps. It has to be beared in 
mind that, in these clusters, it is the high frequency component 
that is detected first in the power spectrum, and it is thus asso- 
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Fig. 15. Close-up of the CoRoT photometric time series of the 
object closest to the cluster 2 mode. The x axis represents the 
modified heliocentric Julian Date, and the y axis, the signal mea- 
sured in counts. 



The majority of eclipsing binaries are collected in clusters 
18, 20 (and 21, and 22, plots not shown). This can be readily 
deduced from the phase behaviour of the dominant frequency 
as illustrated in Fig.|48] It turns out that the object closest to the 
mode of cluster 18 is a binary system (or a blend within the mask 



of CoRoT) with a multiperiodic component (see Fig. 53 1. In gen- 
eral, we found a few eclipsing binaries with pulsating compo- 
nents in the clusters with pulsators discussed above. The major- 
ity of stars in the four binary clusters 18, 20, 21, 22 only show 
eclipses, although some also have a signature of rotational mod- 
ulation and/or activity outside the eclipses. 

Finally, cluster 19 is a mixture of binaries, stars with activity, 
and some seemingly multiperiodic stars, while cluster 13 con- 
tains a mixture of stars with artificial jumps. 
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4.3.1 . Clustering with 2IV1ASS colours 



5. Conclusions 



We have conducted two complementary experiments in order to 
evaluate the information that the 2MASS colour indices could 
potentially provide to the clustering process. First, we analysed 
the clustering structure described above in the light of the in- 
frared properties of the CoRoT targets (i.e., we retain the clusters 
found above and analyse the characteristic colours of the stars 



in these clusters). Figure 16 shows the colour-colour diagram of 
the CoRoT sample, with the cluster membership described in the 
previous section superimposed as red symbols. 

It is evident from the plots that the reddening has smeared out 
the underlying distribution (see the Galactic pole colour-colour 
diagram in the Explanatory Supplement to the 2MASS Second 



Release). Nevertheless, Fig. 17 shows some interesting features 
that arise when the 2MASS colour indices are interpreted in the 
light of the frequency content of the light curves. First of all, 
contamination of cluster 2 by spurious frequency detections due 
to discontinuities in the time series is apparent as vertical lines 
in the corresponding plot. Second, the bimodality in frequency 
space of clusters 6 and 10 (described above) seems to correlate 
well with the J - H colour index, in the sense that the subgroups 
with smaller frequencies correspond partly to a distinct group of 
redder, most probably giant stars. 

In the second experiment, we carried out a new clustering 
of the CoRoT database, but this time we extended it with the 
2MASS colour indices. We then studied the new groups of stars 
in relation with the clusters obtained without 2MASS data. The 
results can be summarized as follows: 



- The clusters interpreted as 6 Scuti and fi Cephei stars (4, 8, 
9, and 11) remain as easily recognizable clusters in the new 
clustering structure. Cluster 9 is split into two clusters, and 
the other three are preserved, with a clear tendency for clus- 
ter 4 to absorb a significant fraction of clusters 8 and 11. The 
original differences between these clusters (described thor- 
oughly in the previous section) seem to be smeared out by 
the use of the 2MASS colours. Little or no contamination is 
observed from other clusters; 

- stars that belonged to the clusters interpreted as y Doradus 
and Slowly Pulsating B stars (5, 7, 12, and 16) remain 
grouped in a number of clearly identifiable clusters, and are 
not contaminated by other clusters. There is again a tendency 
for the clusters to merge into one larger cluster, except for 
cluster 7 that remains largely separated; 

- clusters numbered 1, 2 and 3 in the clustering experiment 
without 2MASS colours, remain the three largest groups 
when the latter are taken into account, but the contigency ta- 
ble (the table that summarizes the number of stars previously 
in cluster / and now in cluster j) shows a major redistribution 
of stars in these categories. We interpret this as a hint that 
objects in clusters 1 and 3 should not be automatically dis- 
regarded as "artifacts". Further studies are needed to better 
separate these from real signals of astrophysical origin; 

- the clearly bimodal cluster 6 splits into two larger clusters 
(interpreted as dwarves and giants in the paragraph above) 
and two smaller ones; a similar behaviour is observed in 
cluster 10, with some stars merging into one of the smaller 
clusters originated in cluster 6. 

- finally, the eclipsing binaries remain clearly separated as 
groups, with minor redistributions of stars in clusters. 



We presented the results of extensive clustering experiments 
with three variability databases of very different characteristics, 
using a density based approach. We concentrated on the large 
scale clustering properties of the samples, trying to understand 
what domain of the variability zoo can be recovered with this 
kind of multivariate analysis. While the Hipparcos archive is 
characterized by a small number of objects with reliable class 
assignments which simplify the interpretation of the clusters, the 
CoRoT database is only starting to be understood due to the very 
high level of precision. 

The results presented here already constitute a clear cut sep- 
aration of variability types, as we understand them based on the- 
oretical grounds. The clusters discussed above have been inter- 
preted on the basis of their average properties, and on the visual 
inspection of those objects close to the clusters modes. Since 
these modes are located at the regions of maximum probabil- 
ity density, these are the prototypes of the clusters, but signif- 
icant contamination from a variety of other objects cannot be 
discarded, and should, in fact, be assumed. This is mainly the 
case for those types where the attributes used to describe the ob- 
jects are insufficient for a unique description of the time series. 
Nevertheless, the existence of well defined clusters containing 
several types of light curves with the imprint of nonradial oscilla- 
tions, rotational modulation, activity, binarity, etc., in the CoRoT 
database, encourages the use and improvement of the current at- 
tribute set to interprete the detailed physics of the clusters. 

We used the Hipparcos and OGLE LMC archives in order 
to define a reference frame that helps both the interpretation of 
the CoRoT database clustering structure and the discovery of 
new classes in it. In the Hipparcos archives, we clearly man- 
aged to separate the classical pulsators (Cepheids and RR Lyrae 
stars) from the eclipsing binaries and from the multiperiodic 
low-amplitude pulsators along the main sequence. This was suc- 
cessful thanks to the availability of a well-calibrated colour in- 
dex. In the LMC case, the use of reddening free attributes like 
the Wessenheit index allows for the recovery of the various se- 
quences of red giants in the long period range of the archive. 
Furthermore, the ellipsoidal variables are easily recognized in 
the OGLE clusters thanks to the log(v)-W/ correlation that they 
follow. 

For the CoRoT exoplanet database, we have shown that we 
are capable of discovering various different properties among 
the low-amplitude multiperiodic stars. Although the confirma- 
tion of the various proposed subclasses among known nonra- 
dial pulsators and stars with activity must await the collection 
of complementary information, likely in the form of spectra, we 
are capable of discovering refinements of the properties of glob- 
ally understood variability. Such refinements are only possible 
thanks to large scale clusters unravelled in the data. 

The unprecedented quality of the time series provided by 
CoRoT has opened a completely new realm of variability, where 
low amplitude signals constitute new clusters by themselves, or 
appear as second or third frequencies combined with large am- 
plitudes ones, sometimes giving rise to a separate cluster 

One of the most interesting findings, is the clear separability 
of the subpopulations among the previously known classes non- 
radial pulsators along the main sequence. Prior to the CoRoT 
mission, we had only been able to hint at the properties of larger 
versus lower amplitude members of the real distribution of these 
objects in paramater space, as well as on the suspicion of hybrid 
pulsators showing pressure and gravity modes simultaneously, 
for the ySCep/Be versus SPB classes on the one hand, and for 
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Fig. 16. 2MASS colour-colour diagrams of the CoRot clusters obtained with cr = 0.2 and no colour index used as atribute. The x and 
y-axis represent the J-K and J -H colour indices respectively. Black dots represent the entire sample and red symbols correspond 
to cluster members. 



the 5 Set versus yDor stars on the other hand. Also, the clus- 
tering pointed towards pulsating stars of various kinds in close 
binaries. Completely new sample of these pulsators will soon be 
publicly available to the scientific community for further studies, 
to be complemented with other sources of information. 

All this knowledge gained with the unsupervised classifi- 
cation of the CoRoT database needs to be incorporated in the 
framework of the supervised classifiers, especially in the case of 
the light curve characteristics due to rotational modulation and 
stellar activity. These variables turn out to represent a very sig- 
nificant fraction of the database while they are not sufficiently 
well represented in the training set used for the supervised clas- 
sification ( |Debosscher et al.|2009| ). 

A systematic analysis of the smallest clusters as well as the 
decomposition into finer scales is left for future work. In the 
case of the CoRoT archive, this further analysis will be comple- 
mented with data from new runs, enhancing seriously the density 
clumps at the base of the method. This will largely facilitate the 
discovery of potential new (sub)classes of variability. We also 
stress that improvements in the handling and removal of the in- 
strumental artifacts in the CoRoT photometric time series, which 
occurred for a larger number of objects in the presented samples, 
will reveal other potential new types of variability not found here 
and will hopefully seriously improve the already very good per- 
formance of our algorithm. 

Finally, even though the main aim of this work was the 
preparation for the analysis of the CoRoT database, we believe 



that the results found in the LMC dataset justify a comparative 
analysis of the OGLE Galactic Bulge data using the same ap- 
proach. 
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Fig. 18. The clustering structure of the Hipparcos archive at o" = 0.2. The x-axis represents the logarithm of the frequency and 
the y-axis, the logarithm of the amplitude of the first component in the Fourier decomposition. Black dots represent the complete 
database and red crosses identify cluster members. 
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Fig. 19. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the frequency and the 
y-axis, the logarithm of the 7?2i ratio between the amplitudes of the first two components in the Fourier decomposition. Black dots 
represent the complete database and red crosses identify cluster members. 
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Fig. 20. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the frequency and the 
y-axis, the phase difference (pi2 between the first two components in the Fourier decomposition. Black dots represent the complete 
database and red crosses identify cluster members. 
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Fig. 21. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the amplitude of the 
first component in the Fourier decomposition, and the y-axis, the logarithm of the /?2i ratio between the amplitudes of the first two 
components. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 22. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the amplitude of the first 
component in the Fourier decomposition, and the y-axis, the phase difference (^12 between the first two components in the Fourier 
decomposition. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 23. The clustering structure of the Hipparcos archive at o" = 0.2. The x-axis represents the logarithm of the amplitude of the 
first component in the Fourier decomposition, and the y-axis, the V - I colour index. Black dots represent the complete database 
and red crosses identify cluster members. 
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Fig. 24. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the R21 ratio between 
the amplitudes of the first two components of the Fourier decomposition, and the y-axis, the phase diff'erence <pi2 between the first 
two components in the Fourier decomposition. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 25. The clustering structure of the Hipparcos archive at cr = 0.2. The x-axis represents the logarithm of the Rox ratio between 
the amplitudes of the first two components of the Fourier decomposition, and the y-axis, the V -I colour index. Black dots represent 
the complete database and red crosses identify cluster members. 
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Fig. 26. The clustering structure of the Hipparcos archive at cr = 0.2. The jc-axis represents the phase difference 0i2 between the first 
two components in the Fourier decomposition, and the y-axis, the V - I colour index. Black dots represent the complete database 
and red crosses identify cluster members. 
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Fig. 27. The clustering structure of the OGLE LMC archive at cr - 0.15. The jc-axis represents the logarithm of the frequency and 
the y-axis, the logarithm of the amplitude of the first component in the Fourier decomposition. Black dots represent the complete 
database and red crosses identify cluster members. 
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Fig. 28. The clustering structure of the OGLE LMC archive at cr - 0.15. The jc-axis represents the logarithm of the frequency and 
the y-axis, the logarithm of the R2\ ratio between the amplitudes of the first two components in the Fourier decomposition. Black 
dots represent the complete database and red crosses identify cluster members. 



Sarro et al.: Comparative clustering analysis of variable stars, Online Material p 13 



-3 -2-10 1 -3 -2-10 1 -3 -2-10 1 -3 -2-10 1 



PS - 

w ' 
o - 




\y . 

'9i 


'■" ]/!^^ 


■ ■ .; j^' - - "' - 


i 


' 1 

Mi. 


^ : ■ ; ■ ■ ' 




II 


PJ - 
Fa . 




■, . 

- ■ ^i'"; 
















m - 

w - 










w 

' i';: :.(■■ 




■■/■i-.;"'-^ 






o - 
Pd - 

CM 1 






. i- 




.]■■'.■■■■ 




. i- 


-mi 




m - 
M - 






■ ; . . ■ 




II 




■ f : ■ ; • ■ ■ 

■: ^ffl ' 




II 


o - 










r.-' ■ , ■ .■ 




■ ■ i;?^ 




■ -I', '.>^ 


W - 
n , 














1 '_[■- 






pa - 

w - 




i'-' ' 






It 




■'' ■"■] . 'J'-- 




1; 


o - 












ill 




: ^vJV^^ -''^^ 




- 

pa . 


1 — — 




















-3 


-2 


1 


-3 -2-10 1 


' — 1 1 

-3 -2 


-1 1 


-3 -2-10 1 



log(v,) 



Fig. 29. The clustering structure of the OGLE LMC archive at cr - 0.15. The .t-axis represents the logarithm of the frequency 
and the y-axis, the phase difference between the first two components in the Fourier decomposition. Black dots represent the 
complete database and red crosses identify cluster members. 
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Fig. 30. The clustering structure of the OGLE LMC archive at cr - 0.15. The x-axis represents the logarithm of the frequency and 
the y-axis, the V - I colour index. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 31. The clustering structure of the OGLE LMC archive aia- - 0.15. The jc-axis represents the logarithm of the amplitude of the 
first component in the Fourier decomposition, and the y-axis, the logarithm of the /?2i ratio between the amplitudes of the first two 
components. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 32. The clustering structure of the OGLE LMC archive at cr = 0.15. The x-axis represents the logarithm of the amplitude of 
the first component in the Fourier decomposition, and the y-axis, the phase difference 4>\i between the first two components in the 
Fourier decomposition. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 33. The clustering structure of the OGLE LMC archive at cr = 0.15. The x-axis represents the logarithm of the amplitude of 
the first component in the Fourier decomposition, and the y-axis, the V - 1 colour index. Black dots represent the complete database 
and red crosses identify cluster members. 
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Fig. 34. The clustering structure of the OGLE LMC archive aia- - 0.15. The .x-axis represents the logarithm of the amplitude of the 
first component in the Fourier decomposition, and the y-axis, the Wessenheit index W/. Black dots represent the complete database 
and red crosses identify cluster members. 
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. 35. The clustering structure of the OGLE LMC archive at cr = 0.15. The jc-axis represents the logarithm of the /?2i ratio between 
amplitudes of the first two components of the Fourier decomposition, and the y-axis, the phase difference (pn between the first 
two components in the Fourier decomposition. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 36. The clustering structure of the OGLE LMC archive at cr = 0. 15. The jc-axis represents the logarithm of the R21 ratio between 
the amplitudes of the first two components of the Fourier decomposition, and the y-axis, the V-I colour index. Black dots represent 
the complete database and red crosses identify cluster members. 
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Fig. 37. The clustering structure of the OGLE LMC archive at cr = 0.15. The jc-axis represents the logarithm of the /?2i ratio 
between the amplitudes of the first two components of the Fourier decomposition, and the y-axis, the Wessenheit index W/. Black 
dots represent the complete database and red crosses identify cluster members. 
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Fig. 38. The clustering structure of the OGLE LMC archive at cr = 0.15. The jc-axis represents the phase difference (pn between 
the first two components in the Fourier decomposition, and the y-axis, the V - I colour index. Black dots represent the complete 
database and red crosses identify cluster members. 
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Fig. 39. The clustering structure of the OGLE LMC archive at cr - 0.15. The x-axis represents the phase difference 0i2 between 
the first two components in the Fourier decomposition, and the y-axis, the Wessenheit index Wj. Black dots represent the complete 
database and red crosses identify cluster members. 
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Fig. 41. Example light curves from objects in clusters 9 (top left), 10 (bottom left), 11 (top right), and 12 (bottom right), of the 
OGLE LMC sample. 
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Fig. 42. The clustering structure of the CoRoT first four runs archive at cr - 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the logarithm of the amplitude of the second Fourier component of the first detected frequency. Black dots 
represent the complete database and red crosses identify cluster members. 
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Fig. 43. The clustering structure of the CoRoT first four runs archive at cr = 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the logarithm of the amplitude of the third Fourier component of the first detected frequency. Black dots 
represent the complete database and red crosses identify cluster members. 
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Fig. 44. The clustering structure of the CoRoT first four runs archive at cr = 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the logarithm of the amplitude of the fourth Fourier component of the first detected frequency. Black dots 
represent the complete database and red crosses identify cluster members. 
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Fig. 45. The clustering structure of the CoRoT first four runs archive at cr - 0.2. The x-axis represents the logarithm of the second 
frequency and the y-axis, the logarithm of the amplitude of the first Fourier component of the second detected frequency. Black dots 
represent the complete database and red crosses identify cluster members. 
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Fig. 46. The clustering structure of the CoRoT first four runs archive at cr = 0.2. The x-axis represents the logarithm of the second 
frequency and the y-axis, the logarithm of the amplitude of the second Fourier component of the second detected frequency. Black 
dots represent the complete database and red crosses identify cluster members. 
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Fig. 47. The clustering structure of the CoRoT first four runs archive at cr - 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the logarithm of the 7?2i ratio between the amplitudes of the first two components of the first frequency in 
the Fourier decomposition. Black dots represent the complete database and red crosses identify cluster members. 
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Fig. 48. The clustering structure of the CoRoT first four runs archive at cr = 0.2. The x-axis represents the logarithm of the first 
frequency and the y-axis, the phase difference between the first two components in the Fourier decomposition of the first 
frequency. The (pn attribute was not used for clustering and it is only used here for illustration purposes. Black dots represent the 
complete database and red crosses identify cluster members. 
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Fig. 49. The clustering structure of the CoRoT first four runs archive at cr - 0.2. The x-axis represents the logarithm of the second 
frequency and the y-axis, the phase difference 0i2 between the first two components in the Fourier decomposition of this second 
frequency. The <pi2 attribute was not used for clustering and it is only used here for illustration purposes. Black dots represent the 
complete database and red crosses identify cluster members. 
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Fig. 50. Close-up of the CoRoT photometric time series of the objects closest to the cluster 6 mode. The x axis represents the 
modified heliocentric Julian Date, and the y axis, the signal measured in units of 10^ counts. 
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Fig. 51. Close-up of the CoRoT photometric time series of the objects closest to the cluster 10 mode. The x axis represents the 
modified heliocentric Julian Date, and the y axis, the signal measured in units of 10^ counts. 
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Fig. 52. Close-up of the CoRoT photometric time series of the object closest to the cluster 17 mode. The x axis represents the 
modified heliocentric Julian Date, and the y axis, the signal measured in counts. 
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Fig. 53. Close-up of the CoRoT photometric time series of the object closest to the cluster 18 mode (eclipsing binaries). The x axis 
represents the modified heUocentric Julian Date, and the y axis, the signal measured in counts. 



